About Data Analysis Report

This RMarkdown file contains the report of the data analysis done for the project on forecasting daily bike rental demand using time series models in R. It contains analysis such as data exploration, summary statistics and building the time series models. The final report was completed on Wed Sep 25 21:18:11 2024.

Data Description:

This dataset contains the daily count of rental bike transactions between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.

Data Source: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset

Relevant Paper:

Fanaee-T, Hadi, and Gama, Joao, ‘Event labeling combining ensemble detectors and background knowledge’, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg

Task One: Load and explore the data

Load data and install packages

## Import required packages
if (!require(tidyverse)) install.packages("tidyverse")
## Loading required package: tidyverse
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
if (!require(lubridate)) install.packages("lubridate")
if (!require(timetk)) install.packages("timetk")
## Loading required package: timetk
## Warning: package 'timetk' was built under R version 4.3.3
if (!require(forecast)) install.packages("forecast")
## Loading required package: forecast
## Warning: package 'forecast' was built under R version 4.3.3
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
# Load the packages
library(tidyverse)
library(lubridate)
library(timetk)
library(forecast)

# Load the data
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip"
download.file(url, destfile = "bike_data.zip")
unzip("bike_data.zip")

# Read the dataset (day.csv file)
bike_data <- read.csv("day.csv")

# View the first few rows of the data
head(bike_data)
##   instant     dteday season yr mnth holiday weekday workingday weathersit
## 1       1 2011-01-01      1  0    1       0       6          0          2
## 2       2 2011-01-02      1  0    1       0       0          0          2
## 3       3 2011-01-03      1  0    1       0       1          1          1
## 4       4 2011-01-04      1  0    1       0       2          1          1
## 5       5 2011-01-05      1  0    1       0       3          1          1
## 6       6 2011-01-06      1  0    1       0       4          1          1
##       temp    atemp      hum windspeed casual registered  cnt
## 1 0.344167 0.363625 0.805833 0.1604460    331        654  985
## 2 0.363478 0.353739 0.696087 0.2485390    131        670  801
## 3 0.196364 0.189405 0.437273 0.2483090    120       1229 1349
## 4 0.200000 0.212122 0.590435 0.1602960    108       1454 1562
## 5 0.226957 0.229270 0.436957 0.1869000     82       1518 1600
## 6 0.204348 0.233209 0.518261 0.0895652     88       1518 1606

Describe and explore the data

# Summary statistics
summary(bike_data)
##     instant         dteday              season            yr        
##  Min.   :  1.0   Length:731         Min.   :1.000   Min.   :0.0000  
##  1st Qu.:183.5   Class :character   1st Qu.:2.000   1st Qu.:0.0000  
##  Median :366.0   Mode  :character   Median :3.000   Median :1.0000  
##  Mean   :366.0                      Mean   :2.497   Mean   :0.5007  
##  3rd Qu.:548.5                      3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :731.0                      Max.   :4.000   Max.   :1.0000  
##       mnth          holiday           weekday        workingday   
##  Min.   : 1.00   Min.   :0.00000   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 4.00   1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.000  
##  Median : 7.00   Median :0.00000   Median :3.000   Median :1.000  
##  Mean   : 6.52   Mean   :0.02873   Mean   :2.997   Mean   :0.684  
##  3rd Qu.:10.00   3rd Qu.:0.00000   3rd Qu.:5.000   3rd Qu.:1.000  
##  Max.   :12.00   Max.   :1.00000   Max.   :6.000   Max.   :1.000  
##    weathersit         temp             atemp              hum        
##  Min.   :1.000   Min.   :0.05913   Min.   :0.07907   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:0.33708   1st Qu.:0.33784   1st Qu.:0.5200  
##  Median :1.000   Median :0.49833   Median :0.48673   Median :0.6267  
##  Mean   :1.395   Mean   :0.49538   Mean   :0.47435   Mean   :0.6279  
##  3rd Qu.:2.000   3rd Qu.:0.65542   3rd Qu.:0.60860   3rd Qu.:0.7302  
##  Max.   :3.000   Max.   :0.86167   Max.   :0.84090   Max.   :0.9725  
##    windspeed           casual         registered        cnt      
##  Min.   :0.02239   Min.   :   2.0   Min.   :  20   Min.   :  22  
##  1st Qu.:0.13495   1st Qu.: 315.5   1st Qu.:2497   1st Qu.:3152  
##  Median :0.18097   Median : 713.0   Median :3662   Median :4548  
##  Mean   :0.19049   Mean   : 848.2   Mean   :3656   Mean   :4504  
##  3rd Qu.:0.23321   3rd Qu.:1096.0   3rd Qu.:4776   3rd Qu.:5956  
##  Max.   :0.50746   Max.   :3410.0   Max.   :6946   Max.   :8714
# Check for missing values
sum(is.na(bike_data))
## [1] 0
# View data structure
str(bike_data)
## 'data.frame':    731 obs. of  16 variables:
##  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : chr  "2011-01-01" "2011-01-02" "2011-01-03" "2011-01-04" ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : int  6 0 1 2 3 4 5 6 0 1 ...
##  $ workingday: int  0 0 1 1 1 1 1 0 0 1 ...
##  $ weathersit: int  2 2 1 1 1 1 2 2 1 1 ...
##  $ temp      : num  0.344 0.363 0.196 0.2 0.227 ...
##  $ atemp     : num  0.364 0.354 0.189 0.212 0.229 ...
##  $ hum       : num  0.806 0.696 0.437 0.59 0.437 ...
##  $ windspeed : num  0.16 0.249 0.248 0.16 0.187 ...
##  $ casual    : int  331 131 120 108 82 88 148 68 54 41 ...
##  $ registered: int  654 670 1229 1454 1518 1518 1362 891 768 1280 ...
##  $ cnt       : int  985 801 1349 1562 1600 1606 1510 959 822 1321 ...
# Convert date column to Date type
bike_data$dteday <- as.Date(bike_data$dteday)

# Plot rental count by date
ggplot(bike_data, aes(x = dteday, y = cnt)) +
  geom_line() +
  labs(title = "Daily Bike Rentals Over Time", x = "Date", y = "Number of Rentals")

Task Two: Create interactive time series plots

## Read about the timetk package
# ?timetk
# Interactive time series plot using timetk
bike_data %>%
  plot_time_series(dteday, cnt, .interactive = TRUE, .title = "Interactive Plot of Daily Bike Rentals")

Task Three: Smooth time series data

# Smooth the time series data using moving averages
bike_data %>%
  mutate(cnt_smooth = forecast::ma(cnt, order = 7)) %>%
  ggplot(aes(x = dteday)) +
  geom_line(aes(y = cnt), color = "blue", alpha = 0.5) +
  geom_line(aes(y = cnt_smooth), color = "red") +
  labs(title = "Smoothed Bike Rentals with 7-Day Moving Average", y = "Number of Rentals", x = "Date")
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_line()`).

Task Four: Decompse and access the stationarity of time series data

# Install tseries package if not installed
if (!require(tseries)) install.packages("tseries")
## Loading required package: tseries
## Warning: package 'tseries' was built under R version 4.3.3
# Load the package
library(tseries)

# Decompose time series data
bike_ts <- ts(bike_data$cnt, frequency = 365)
decomp <- decompose(bike_ts)

# Plot the decomposition
plot(decomp)

# Test for stationarity using Augmented Dickey-Fuller test
adf_test_result <- adf.test(bike_ts)

# View ADF test result
print(adf_test_result)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  bike_ts
## Dickey-Fuller = -1.6351, Lag order = 9, p-value = 0.7327
## alternative hypothesis: stationary

Task Five: Fit and forecast time series data using ARIMA models

# Fit ARIMA model
fit <- auto.arima(bike_ts)

# Forecast future rentals for the next 30 days
forecast_vals <- forecast(fit, h = 30)

# Plot the forecast
autoplot(forecast_vals) +
  labs(title = "Bike Rental Forecast for Next 30 Days", x = "Date", y = "Number of Rentals")

Task Six: Findings and Conclusions


Key Steps in the Project:

  1. Data Loading & Cleaning: The dataset is loaded from the UCI repository and cleaned by converting the date to a proper format and handling missing values.

  2. Data Exploration: Summary statistics and time series plots give insights into the overall pattern of bike rentals over time.

  3. Smoothing: A 7-day moving average is used to smooth the fluctuations in daily bike rental counts.

  4. Stationarity Test & Decomposition: The time series is decomposed to view trends, seasonality, and residuals. An Augmented Dickey-Fuller test is used to assess stationarity.

  5. ARIMA Model: An ARIMA model is fitted to forecast future bike rentals, and a 30-day forecast is made to assist with business decisions.

  6. Conclusions: Business insights based on the analysis and recommendations for fleet and pricing optimizations.